Interleaved image support in tokenizers #1138

RdoubleA · 2024-07-02T21:01:57Z

Adding support for image special tokens in tokenizers and Message dataclass across the library. Starting with Llama3Tokenizer as an example, as multimodal Flamingo will be using this.

This PR will enabled support for multiple images per sample, interleaved throughout the text. This is done by overhauling how content is represented in the Message class:

[
    Message(role="user", content=[{"type": "image"}, {"type": "text", "content":"hello"}, ...]),
    ...
]

For backwards compatibility, passing a string into the content field will automatically get casted into the list of dictionaries. Using dictionaries is visually clunky, but it enables the most flexibility for any non-text content that needs to be tokenized differently and future metadata fields that may affect tokenization.

We must also use a different paradigm for chat formats. Now, treat it as extra tags that are prepended/appended to the Message content list based on the role:

for message in sample:
      content = (
          [{"type": "text", "content": cls.template[message.role][0]}]
          + message.content
          + [{"type": "text", "content": cls.template[message.role][1]}]
      )
      formatted_dialogue.append(
          Message(
              role=message.role,
              content=content,
              masked=message.masked,
              ipython=message.ipython,
              eot=message.eot,
          ),
      )

Changelog

Overhaul to Message class. Update Message content field to be multimodal - can have text or image content in a list of dictionaries. Added properties for easily checking if there's image content or getting the text only content
Update all chat format tests
Update all chat formats to use the new Message content structure. Now we simply prepend and append tags as needed
If Message content contains type==image, then add image special token in Llama3Tokenizer

Test Plan

Compared to a reference implementation token by token. Made this into new unit tests for text only messages, text + image, text + interleaved image, tool calls

Other considered approaches

There are other approaches I considered for enabling interleaved images in a message. The important workflow to consider is that users will have to use our utilities or create their own transforms to get their datasets into our Message format with interleaved image content. We should make this workflow as easy as possible.

1

Use an enum, Media.IMAGE, and make Message.content a List[Union[str, Media]] that combines texts and places images in between strings

[
    Message(role="user", content=[Media.IMAGE, "hello", Media.IMAGE, "world"]),
    ...
]

Pros: slightly less tedious to convert user datasets into this structure and also process this (versus dictionaries)
Cons: not easy to support image metadata or other fields that may affect image tokenization in the future

2

Keep everything as a string and allow special tokens in the text itself. Tokenizer splits on the indicated image token and adds the correct special token id and encodes the strings normally

[
    Message(role="user", content="<img> hello <img> world", media={"image": "<img>"})
    ...
]

Pros: More prompt template / chat format friendly, as we can still format the string as is. In other approaches, since the text is broken up, prompt templating is considerably more difficult. Easier to process string only content, no impact on text-only workflows
Cons: we are breaking our rule of no special tokens in the text

pytorch-bot · 2024-07-02T21:01:59Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/1138

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

Checkout action fails due to incompatible GLIBC

✅ No Failures

As of commit fca9031 with merge base f158577 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

joecummings · 2024-07-02T21:50:24Z

recipes/generate.py

@@ -14,7 +14,7 @@

 from torchtune import config, utils
 from torchtune.config._utils import _get_component_from_path
-from torchtune.data import ChatFormat, InstructTemplate, Message
+from torchtune.data import apply_chat_format, ChatFormat, InstructTemplate, Message


copycat copycat copycat

codecov-commenter · 2024-07-03T02:11:01Z

Codecov Report

Attention: Patch coverage is 94.70199% with 8 lines in your changes missing coverage. Please review.

Project coverage is 66.14%. Comparing base (f158577) to head (7da4189).

Files	Patch %	Lines
tests/test_utils.py	89.65%	3 Missing ⚠️
torchtune/data/_types.py	90.00%	2 Missing ⚠️
torchtune/models/llama3/_tokenizer.py	96.42%	1 Missing ⚠️
torchtune/models/phi3/_tokenizer.py	80.00%	1 Missing ⚠️
torchtune/modules/tokenizers/_utils.py	83.33%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1138      +/-   ##
==========================================
+ Coverage   65.98%   66.14%   +0.15%     
==========================================
  Files         194      194              
  Lines        9023     9042      +19     
==========================================
+ Hits         5954     5981      +27     
+ Misses       3069     3061       -8

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

ebsmothers

Thanks for updating the design based on our discussions. Have a handful of comments but no huge concerns. Can you also fix the doc build job failure and update the PR summary to reflect the design we settled on here?

torchtune/data/_types.py

tests/torchtune/datasets/test_chat_dataset.py

tests/torchtune/models/llama3/test_llama3_tokenizer.py

torchtune/models/llama3/_tokenizer.py

torchtune/data/_chat_formats.py

RdoubleA added 29 commits June 11, 2024 23:49

complete tokenizer refactor

75dae87

move tokenizers under data/

0c20ba9

fix all tests

730a2c9

Merge branch 'main' into tokenizer

acf7e81

start to address comments

2ae157c

load in special tokens, move tokenizer directory back, address comments

6a50cd5

fix encode whitespace

61534d0

updates after manual comparisons

1d6e5e3

default special tokens

5712de4

fix docs

d84bbda

fix doc strings

5a8b82b

Merge branch 'main' into tokenizer

52643cb

fix tests

a00c1dc

fix SP test

29273ca

Merge branch 'main' into tokenizer

adca77e

update api ref

b204563

Merge branch 'main' into tokenizer

e236916

fix llama3 toeknizer test:

93028cf

add image support

fb12cbb

tool support

b5bf410

update tests

00f266f

update tests

c815069

use images as attachments instead

21b3ea8

update all tests

adbfb20

use list of dicts for MM messages

1e40a9d

fix chat formats

0d3665c

run linter

95edf70

Merge branch 'main' into tokenizer_updates

a3067aa

merge main

d49febf

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jul 2, 2024

RdoubleA changed the title ~~[WIP] Image/tooling support in tokenizers~~ [WIP] Interleaved image support in tokenizers Jul 2, 2024

joecummings reviewed Jul 2, 2024

View reviewed changes

fix chat formats

7da4189

RdoubleA changed the title ~~[WIP] Interleaved image support in tokenizers~~ Interleaved image support in tokenizers Jul 3, 2024

RdoubleA marked this pull request as ready for review July 3, 2024 02:09

ebsmothers reviewed Jul 3, 2024

View reviewed changes

address comments, fix docs

fca9031

ebsmothers approved these changes Jul 4, 2024

View reviewed changes

RdoubleA merged commit b317c8f into pytorch:main Jul 4, 2024
29 checks passed

maximegmd pushed a commit to maximegmd/torchtune that referenced this pull request Jul 13, 2024

Interleaved image support in tokenizers (pytorch#1138)

0fd5da8

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Interleaved image support in tokenizers #1138

Interleaved image support in tokenizers #1138

RdoubleA commented Jul 2, 2024 •

edited

Loading

pytorch-bot bot commented Jul 2, 2024 •

edited

Loading

joecummings Jul 2, 2024

codecov-commenter commented Jul 3, 2024 •

edited

Loading

ebsmothers left a comment

Interleaved image support in tokenizers #1138

Interleaved image support in tokenizers #1138

Conversation

RdoubleA commented Jul 2, 2024 • edited Loading

Changelog

Test Plan

Other considered approaches

1

2

pytorch-bot bot commented Jul 2, 2024 • edited Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/1138

❗ 1 Active SEVs

✅ No Failures

joecummings Jul 2, 2024

Choose a reason for hiding this comment

codecov-commenter commented Jul 3, 2024 • edited Loading

Codecov Report

ebsmothers left a comment

Choose a reason for hiding this comment

RdoubleA commented Jul 2, 2024 •

edited

Loading

pytorch-bot bot commented Jul 2, 2024 •

edited

Loading

codecov-commenter commented Jul 3, 2024 •

edited

Loading